more catching of build errors, dump of logs, etc.; add registry pod d… #8591

gabemontero · 2016-04-21T20:30:03Z

…isk usage analysis

@bparees @csrwng FYI - next round of debug on ext test registry disk quota error ... caught some more tests' build failures, and added the running of df/du inside the registry pod.

gabemontero · 2016-04-21T20:30:29Z

[testonlyextended][extended:core(builds)]

bparees · 2016-04-21T20:44:35Z

lgtm, needs @smarterclayton approval. and let's make sure we have a plan to strip all this out (aside from the waitforabuild calls which seems like a good general improvement) once we get to the bottom of this.

smarterclayton · 2016-04-21T22:27:26Z

Why do we need to merge this vs just running testonlyextended?

gabemontero · 2016-04-22T00:43:24Z

@smarterclayton I'm in agreement with not merging at this time, especially given the lastest test results. The df/du of registry pod along with the df of the ec2 instance shed some light on the situation, though the root cause is still TBD.

Specifically:

The registry fs does in fact seem to be allocated the entire fs from the host system, as @bparees stipulated earlier today. I'll paste a couple of relevant entries here:

ec2 df:
/dev/mapper/docker--vg-openshift--xfs--vol--dir 1038336 295328 743008 29% /mnt/openshift-xfs-vol-dir

registry df:
/dev/mapper/docker--vg-openshift--xfs--vol--dir 1038336 295328 743008 29% /registry

Those two entries were immediately after a push to the registry failed with the same message reported earlier:

time="2016-04-21T21:51:06.639377806Z" level=error msg="response completed with error" err.code=UNKNOWN err.detail="filesystem: write /registry/docker/registry/v2/repositories/extended-test-build-image-source-eoxy7-hoeui/imagesourceapp/_uploads/30d40477-3875-4f92-b607-1e6b418f4d79/data: disk quota exceeded" err.message="unknown error" go.version=go1.6 http.request.host="172.30.131.224:5000" http.request.id=941f71c2-a05f-4830-a6d7-cb885abcb54f http.request.method=PATCH http.request.remoteaddr="172.18.12.132:34979" http.request.uri="/v2/extended-test-build-image-source-eoxy7-hoeui/imagesourceapp/blobs/uploads/30d40477-3875-4f92-b607-1e6b418f4d79?_state=KKbjEmD56jWJfd8GXI1LpPcAM2QkR8GiEg2V9UrfdlN7Ik5hbWUiOiJleHRlbmRlZC10ZXN0LWJ1aWxkLWltYWdlLXNvdXJjZS1lb3h5Ny1ob2V1aS9pbWFnZXNvdXJjZWFwcCIsIlVVSUQiOiIzMGQ0MDQ3Ny0zODc1LTRmOTItYjYwNy0xZTZiNDE4ZjRkNzkiLCJPZmZzZXQiOjAsIlN0YXJ0ZWRBdCI6IjIwMTYtMDQtMjFUMjE6NTA6NTguMTkxMzI5MjVaIn0%3D" http.request.useragent="docker/1.9.1 go/go1.4.2 kernel/3.10.0-229.7.2.el7.x86_64 os/linux arch/amd64" http.response.contenttype="application/json; charset=utf-8" http.response.duration=8.298964494s http.response.status=500 http.response.written=293 instance.id=63aa5205-b8fb-4cf4-ad4e-12d25be0054b vars.name="extended-test-build-image-source-eoxy7-hoeui/imagesourceapp" vars.uuid=30d40477-3875-4f92-b607-1e6b418f4d79

The du /registry output on the registry pod matched up with the data from df. A mixture of image entries, secret entries, etc. A decent number of files, none overly big.

So either the entry that causes the problem is exceptionally large, and is then cleaned up as part of error processing, hence gone before we can run df/du, or the message details about disk quota being violated are misleading somehow (or we are misinterpreting what it means).

As @bparees and I discussed earlier, I think we've done due diligence here. It is time to pull in the team that owns the registry, and get their help in deciphering what is going on. I'm assuming we can add whatever debug they would need into this PR, and rerun the extended test.

Thoughts?

gabemontero · 2016-04-24T17:55:33Z

@pweil- during the build/image extended tests (both with the periodic jobs from from https://ci.openshift.redhat.com/jenkins/job/origin_extended_build_tests , as well as individual PRs like this one) we have been getting an intermittent 500 on pushing images to the internal registry.

When examining the registry logs, we see messages at the corresponding times to the push failures in the builder logs like the ones noted in my prior comment. However, while the message implies a disk space / quota issue, our examinations of the file system in particular show plenty of space.

Are the logs for the last test run sufficient for your team to provide some insight? Or do we need to re-run with some additional debug?

thanks

pweil- · 2016-04-25T12:13:50Z

I believe that @soltysh added some additional logging specifically for docker. @soltysh can you take a look here?

soltysh · 2016-04-25T12:31:26Z

Looking...

gabemontero · 2016-04-28T12:53:29Z

hey @soltysh - have you had a chance to repro / look at this any? Is there some debug you would like me to add to the PR and retest?

soltysh · 2016-04-28T14:52:23Z

@gabemontero sorry, didn't get to it, yet. I wanted to finish job-related stuff first. Hopefully I should have some time tomorrow to look at it.

gabemontero · 2016-04-28T14:54:05Z

Cool no worries @soltysh

On Thursday, April 28, 2016, Maciej Szulik [email protected] wrote:

@gabemontero https://github.com/gabemontero sorry, didn't get to it,
yet. I wanted to finish job-related stuff first. Hopefully I should have
some time tomorrow to look at it.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8591 (comment)

gabemontero · 2016-05-02T14:21:56Z

So a little internet research turned up a docker support thread, that while unresolved, showed some similarities with what we are seeing: https://meta.discourse.org/t/disk-quota-exceeded-when-chown-ing/36500/20

Based on this and another couple of similar hits, I"m going to add a call to quota along with the df usage in the debug I added and see if it sheds any light.

If not, @csrwng had the suggestion that maybe selinux is playing a part. Perhaps I'll see about disabling it in the PR test runs, see if any change in behavior occurs.

gabemontero · 2016-05-03T01:45:05Z

Things attempted / found today:

after adding the quota command to the openshift registry image, running it in the registry container after the error provides not data, it simply returns
running getenforce in the AWS container hosting the tests reports it is already in "Permissive" mode
I can manually create directories in /registry directory in the docker registry pod via mkdir after the error occurs
However, an attempt to create a directory in /registry/docker/registry/v2/repositories after the error occurs via mkdir in the registry pod results in the same disk quota exceeded error.
I even hacked docker to attempt doing a mkdir when the golang os.MkdirAll failed, and go the same disk quota exceeded error.

Still think I need to figure out how to get a report of quota level stats and/or config in the registry container .... might next look and dumping the contents of /etc/mtab or /proc/self/mounts (if it in fact either exists in the registry container) or whatever is the right spot in this scenario.

gabemontero · 2016-05-04T20:42:10Z

Some more updates:

the inconsistency between being able to create directories in /registry vs. /registry/docker/registry/v2/repositories ... most likely I misinterpreted the earlier results. But consistency across an FS makes more sense.
not entirely confident that selinux is not coming into play ... based on this doc I've tried examining /var/logs/audit and running ausearch (both from the AWS host and from within the registry container) but no errors turned up.

Definitely feel like we've moved past my linux/AWS expertise and possibly to the point of diminishing returns in trying to uncover the disk quota exceeded behavior.

Before officially punting and simply opening an issue/bugzilla to officially track this with the platform mgmt team, I'm going to try again to look at pruning images from the registry between tests and see if that keeps us under this mysterious disk quota.

smarterclayton · 2016-05-04T23:21:52Z

Note that we run conformance tests with an XFS quota on empty dir.
@dgoodwin

On May 4, 2016, at 4:42 PM, Gabe Montero [email protected] wrote:

Some more updates:

the inconsistency between being able to create directories in /registry
vs. /registry/docker/registry/v2/repositories ... most likely I
misinterpreted the earlier results. But consistency across an FS makes more
sense.
not entirely confident that selinux is not coming into play ... based
on this doc
https://docs.openshift.com/enterprise/3.1/install_config/persistent_storage/pod_security_context.html#selinuxoptions
I've tried examining /var/logs/audit and running ausearch (both from the
AWS host and from within the registry container) but no errors turned up.

Definitely feel like we've moved past my linux/AWS expertise and possibly
to the point of diminishing returns in trying to uncover the disk quota
exceeded behavior.

Before officially punting and simply opening an issue/bugzilla to
officially track this with the platform mgmt team, I'm going to try again
to look at pruning images from the registry between tests and see if that
keeps us under this mysterious disk quota.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8591 (comment)

gabemontero · 2016-05-04T23:28:08Z

Thanks for the confirmation. Still wonder why the 'quota' cmd returns
blank. A quick search on xfs quota turned up 'report' as well. I'll try
adding an invocation of that.

On Wednesday, May 4, 2016, Clayton Coleman [email protected] wrote:

Note that we run conformance tests with an XFS quota on empty dir.
@dgoodwin

On May 4, 2016, at 4:42 PM, Gabe Montero <[email protected]
javascript:_e(%7B%7D,'cvml','[email protected]');> wrote:

Some more updates:

the inconsistency between being able to create directories in /registry
vs. /registry/docker/registry/v2/repositories ... most likely I
misinterpreted the earlier results. But consistency across an FS makes more
sense.

not entirely confident that selinux is not coming into play ... based
on this doc
<
https://docs.openshift.com/enterprise/3.1/install_config/persistent_storage/pod_security_context.html#selinuxoptions

I've tried examining /var/logs/audit and running ausearch (both from the
AWS host and from within the registry container) but no errors turned up.

Definitely feel like we've moved past my linux/AWS expertise and possibly
to the point of diminishing returns in trying to uncover the disk quota
exceeded behavior.

Before officially punting and simply opening an issue/bugzilla to
officially track this with the platform mgmt team, I'm going to try again
to look at pruning images from the registry between tests and see if that
keeps us under this mysterious disk quota.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8591 (comment)

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8591 (comment)

gabemontero · 2016-05-05T17:57:59Z

Well, xfs_quota -xc report at least returns something at the end of the run when executed on the AWS test host, though nothing that would indicate disk quotas being defined or contention on said resources:

Group quota on /mnt/openshift-xfs-vol-dir (/dev/mapper/docker--vg-openshift--xfs--vol--dir)
Blocks
Group ID Used Soft Hard Warn/Grace

ec2-user 0 0 0 00 [--------]

Group quota on /mnt/openshift-xfs-vol-dir (/dev/mapper/docker--vg-openshift--xfs--vol--dir)
Blocks
Group ID Used Soft Hard Warn/Grace

ec2-user 0 0 0 00 [--------]

Running the same command from the extended test when the error is reported return nothing.

@dgoodwin @smarterclayton do we perhaps have some sort of xfs_quota or docker emptydir problem?

gabemontero · 2016-05-06T17:47:29Z

@bparees @dgoodwin - PTAL at the proposed changes, I'd like to get these merged and move on to the next set of issues with the build/images extended tests.

On the xfs_quota related change, I backed off a small bit on the increase in the quota size (went from 256 Mi to 896 Mi), so that we don't consume the full 1 Gi amount the ec2 instances have allocated. So in theory we should still see the quota stop things before the FS is in fact fully consumed.

Also, @bparees @csrwng - the 2 tests with the internal git server in gitauth.go have now failed a couple of times in a row with the same error since we moved past the disk quota issue. And these are the only build extended tests that failed. That will be the next area of focus for my debug, but I seem to recall from scrum some debug activity with the internal git server. Are my recollections correct / are there known issues that would affect these external tests? If not, some background into why the URLs are constructed the way they are might be helpful.

thanks

gabemontero · 2016-05-06T19:04:17Z

I can reproduce the gitauth.go extended test errors locally.

bparees · 2016-05-06T19:23:44Z

test/extended/util/framework.go

@@ -48,25 +49,56 @@ func DumpBuildLogs(bc string, oc *CLI) {
 		fmt.Fprintf(g.GinkgoWriter, "\n\n  got error on bld logs %v\n\n", err)
 	}

-	ExamineDiskUsage()
+	// if we suspect that we are filling up the registry file syste, call ExamineDiskUsage / ExaminePodDiskUsage
+	// also see if manipulations of the quota around /mnt/openshift-xfs-vol-dir exist in the extended test set up scripts


this would be one case where i'd leave the call in place, commented out.

add calls commented out

bparees · 2016-05-06T19:24:20Z

@gabemontero one nit and then i'd be ok w/ merging this if that's what you want.

…isk usage analysis

openshift-bot · 2016-05-06T19:30:34Z

Evaluated for origin testonlyextended up to 5be0a98

gabemontero · 2016-05-06T19:31:20Z

@bparees yeah, let's go ahead and merge ... aside from wanting to see the results for the build extended tests on the official job, I'm curious on the delta behavior on the images extended tests (something seems off with my PR for the images side).

We'll work these gitauth.go failures in a separate PR. The are showing up locally anyway they are just broke vs. getting impacted by the env.

csrwng · 2016-05-06T19:32:56Z

The problem with the gitserver tests is that they rely on the router and .xip.io to test ssl auth:
F0506 17:19:14.732361 1 builder.go:204] Error: build error: fatal: unable to access 'https://gitserver.172.18.2.53.xip.io/ruby-hello-world/': Could not resolve host: gitserver.172.18.2.53.xip.io; Name or service not known

It seems that .xip.io resolution has been flakier than usual lately

gabemontero · 2016-05-06T19:35:14Z

Yep that was where I was at ... where I was headed was simply pulling the
IP/port from the service and be done with it.

Or do we really care about the name resolution?

On Fri, May 6, 2016 at 3:33 PM, Cesar Wong [email protected] wrote:

The problem with the gitserver tests is that they rely on the router and .
xip.io to test ssl auth:
F0506 17:19:14.732361 1 builder.go:204] Error: build error: fatal: unable
to access 'https://gitserver.172.18.2.53.xip.io/ruby-hello-world/': Could
not resolve host: gitserver.172.18.2.53.xip.io; Name or service not known

It seems that .xip.io resolution has been flakier than usual lately

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8591 (comment)

csrwng · 2016-05-06T19:36:50Z

We need it if we want to test secured communication to the server (via SSL)... only possible through the router currently

gabemontero · 2016-05-06T19:46:10Z

The router logs has this (and something similar for the gitserver-tokenauth service):

W0505 21:01:11.074032 1 router.go:572] a edge terminated route with host gitserver.172.18.5.125.xip.io does not have the required certificates. The route will still be created but no certificates will be written

Don't know yet if this is benign or not.

bparees · 2016-05-06T19:52:10Z

lgtm, thanks for the deep dive on this @gabemontero
[merge]

openshift-bot · 2016-05-06T19:55:38Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/5841/) (Image: devenv-rhel7_4131)

openshift-bot · 2016-05-06T19:55:38Z

Evaluated for origin merge up to 5be0a98

openshift-bot · 2016-05-06T19:55:39Z

[Test]ing while waiting on the merge queue

openshift-bot · 2016-05-06T20:00:38Z

Evaluated for origin test up to 5be0a98

gabemontero · 2016-05-06T20:06:46Z

NP @bparees

On the route, based on the comments on this method, we might possible expect a connection error, not a resolve host error.

Going to hit pause for now, wait for this PR to merge, and then see about pulling in the networking team / brainstorming during scrum on Monday.

csrwng · 2016-05-06T20:23:31Z

@gabemontero If what you're seeing is this:
https://github.com/openshift/origin/blob/master/pkg/router/template/router.go#L570
that is expected. The default route certificate is a wildcard certificate. I don't think that's the issue. The problem is that the .xip.io address is not getting resolved. The fix would be to come up with an alternative, more stable dns server. I'll investigate if we can bring up our own proxy server that resolves the one host we need.

gabemontero · 2016-05-06T20:29:09Z

@csrwng yep that is where I was landing wrt that certificate based log. The use of the openshift proxy server as an alternative sounds cool. Consider the baton officially passed to you then until you get a chance to investigate :-)

openshift-bot · 2016-05-06T20:40:33Z

continuous-integration/openshift-jenkins/testonlyextended FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin_extended/162/) (Extended Tests: core(builds))

openshift-bot · 2016-05-06T21:00:32Z

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/3654/)

soltysh · 2016-05-09T12:43:47Z

@gabemontero awesome work!!! Thank you! And sorry I wasn't able to get into that...

gabemontero · 2016-05-09T13:12:04Z

Thanks @soltysh - and no worries - it was educational ;-)

gabemontero self-assigned this Apr 21, 2016

gabemontero force-pushed the buildExtTestDbg branch 2 times, most recently from 26c0262 to cc744a9 Compare April 26, 2016 15:04

gabemontero force-pushed the buildExtTestDbg branch from cc744a9 to f2c3d59 Compare April 28, 2016 14:15

gabemontero force-pushed the buildExtTestDbg branch from f2c3d59 to 2e76fdd Compare April 29, 2016 17:40

gabemontero force-pushed the buildExtTestDbg branch 6 times, most recently from 6939a7c to 3057cde Compare May 2, 2016 21:13

gabemontero force-pushed the buildExtTestDbg branch 2 times, most recently from dfdde46 to bcdcf06 Compare May 4, 2016 19:14

gabemontero force-pushed the buildExtTestDbg branch from bcdcf06 to 31e2f20 Compare May 5, 2016 16:41

gabemontero force-pushed the buildExtTestDbg branch 3 times, most recently from 76db393 to d01f79d Compare May 6, 2016 16:34

bparees reviewed May 6, 2016
View reviewed changes

more catching of build errors, dump of logs, etc.; add registry pod d…

5be0a98

…isk usage analysis

gabemontero force-pushed the buildExtTestDbg branch from d01f79d to 5be0a98 Compare May 6, 2016 19:28

openshift-bot merged commit 91e2f0b into openshift:master May 7, 2016

gabemontero deleted the buildExtTestDbg branch May 7, 2016 16:23

rhcarvalho mentioned this pull request Jun 28, 2016

Do not hard code build name in extended tests #9605

Merged

more catching of build errors, dump of logs, etc.; add registry pod d… #8591

more catching of build errors, dump of logs, etc.; add registry pod d… #8591

Conversation

gabemontero commented Apr 21, 2016

gabemontero commented Apr 21, 2016

bparees commented Apr 21, 2016

smarterclayton commented Apr 21, 2016

gabemontero commented Apr 22, 2016

gabemontero commented Apr 24, 2016

pweil- commented Apr 25, 2016

soltysh commented Apr 25, 2016

gabemontero commented Apr 28, 2016

soltysh commented Apr 28, 2016

gabemontero commented Apr 28, 2016

gabemontero commented May 2, 2016

gabemontero commented May 3, 2016

gabemontero commented May 4, 2016

smarterclayton commented May 4, 2016

gabemontero commented May 4, 2016

gabemontero commented May 5, 2016

gabemontero commented May 6, 2016

gabemontero commented May 6, 2016

bparees May 6, 2016

Choose a reason for hiding this comment

gabemontero May 6, 2016

Choose a reason for hiding this comment

bparees commented May 6, 2016

openshift-bot commented May 6, 2016

gabemontero commented May 6, 2016

csrwng commented May 6, 2016

gabemontero commented May 6, 2016

csrwng commented May 6, 2016

gabemontero commented May 6, 2016

bparees commented May 6, 2016

openshift-bot commented May 6, 2016 • edited Loading

openshift-bot commented May 6, 2016

openshift-bot commented May 6, 2016

openshift-bot commented May 6, 2016

gabemontero commented May 6, 2016

csrwng commented May 6, 2016

gabemontero commented May 6, 2016

openshift-bot commented May 6, 2016

openshift-bot commented May 6, 2016

soltysh commented May 9, 2016

gabemontero commented May 9, 2016

openshift-bot commented May 6, 2016 •

edited

Loading